{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Uniform Manifold Approximation and Projection (UMAP)\n",
    "\n",
    "UMAP is a dimensionality reduction algorithm which performs non-linear dimension reduction. It can also be used for visualization. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The model can take array-like objects, either in host as NumPy arrays or in device (as Numba or cuda_array_interface-compliant), as well as cuDF DataFrames as the input. \n",
    "\n",
    "In order to convert your dataset to cudf format please read the cudf documentation on https://docs.rapids.ai/api/cudf/stable.\n",
    "\n",
    "For additional information on the UMAP model please refer to the documentation on https://rapidsai.github.io/projects/cuml/en/stable/api.html#cuml.UMAP"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import numpy as np\n",
    "\n",
    "import pandas as pd\n",
    "import cudf as gd\n",
    "\n",
    "from sklearn import datasets\n",
    "\n",
    "from sklearn.metrics import adjusted_rand_score\n",
    "from sklearn.cluster import KMeans\n",
    "\n",
    "from sklearn.manifold.t_sne import trustworthiness\n",
    "\n",
    "from cuml.manifold.umap import UMAP as cumlUMAP"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generate Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_samples = 500\n",
    "n_features = 10\n",
    "n_centers = 5\n",
    "\n",
    "n_neighbors = 10"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data, labels = datasets.make_blobs(n_samples=n_samples, \n",
    "                                   n_features=n_features, \n",
    "                                   centers=n_centers)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fit Embeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cuml_umap = cumlUMAP()\n",
    "embedding = cuml_umap.fit_transform(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluate Neighborhoods"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Calculate the score of the results obtained using cuml's algorithm and sklearn k-means. A score of 1.0 means the labels in our embedding match the original labels (thus preserving local neighborhood structure well)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adjusted_rand_score(labels, KMeans(n_centers).fit_predict(embedding))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load Iris Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "iris = datasets.load_iris()\n",
    "data = iris.data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fit Embeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%%time\n",
    "cuml_umap = cumlUMAP(n_neighbors=n_neighbors, \n",
    "                     min_dist=0.01)\n",
    "\n",
    "embedding = cuml_umap.fit_transform(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluate Trustworthiness\n",
    "\n",
    "Trustworthiness is a measure of how well an embedding preserves local neighborhood structure. It uses the nearest neighbors of the input vectors to rank the neighbors of the output vectors. Large divergences in neighborhoods between input and output vectors lower the score. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "trustworthiness(iris.data, embedding, 10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Split Train / Test Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a selection variable which will have 75% training and 25% testing values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "iris_selection = np.random.choice(\n",
    "    [True, False], 150, replace=True, p=[0.75, 0.25])\n",
    "\n",
    "data = iris.data[iris_selection]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Train Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cuml_umap = cumlUMAP(n_neighbors=n_neighbors, min_dist=0.01, verbose=False)\n",
    "cuml_umap.fit(data)\n",
    "\n",
    "# create a new iris dataset by inverting the values of the selection variable (ie. 75% False and 25% True values) \n",
    "new_data = iris.data[~iris_selection]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Predict Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "embedding = cuml_umap.transform(new_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluate Trustworthiness\n",
    "\n",
    "Evaluating the trustworthiness on predictions from unseen data gives an indication of UMAP's ability to map the unseen data onto the manifold constructed from the training data. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "trustworthiness(new_data, embedding, 10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
